This dataset is from Kaggle. It was entirely scraped via the Goodreads API’s database. The kaggple page creator says that the intention to creat this dataset is to have a clear idea about the books recommendation judging by the number.
According to Wikipedia, Goodreads is a social cataloging website that allows individuals to freely search its database of books, annotations, and reviews. Users can sign up and register books to generate library catalogs and reading lists. They can also create their own groups of book suggestions, surveys, polls, blogs, and discussions. The website’s offices are located in San Francisco. The company is owned by the online retailer Amazon. On July 23, 2013, Goodreads announced on their website their user base had grown to 20 million members, having doubled in close to 11 months.
As one of the world’s most influential reading sites, Goodreads provides a platform for people interested in talking about books. This goodreads data sets contains all the listed books on GoodRead books platform. It contains the books basic infromation, the rating and reviews count. The dataset was updated in 2019 and also it is totally tidy and clean.
Personally, I would like to use this dataset as a reference to make my own reading list.
Here are some questions that I want to find out the answer by analyzing the dataset:
Who is the most productive writer?
Which book has the most pages?
Which book has the highest rating?
Which book people discuss about mostly?
Is there any connection bewteen the number of books writing and the average rating of book? In other words, is it true that the more books writer writes, the higher average rating he or she owns? Or is there any other factors which can affect the ratings?
The first thing I need to do is import data and have a total view of the whole dataset. Though the dataset is very clean, there are still some parts need to be adjusted. As a result, I delete the irrelevant columns and change the columns names.
(In order to make the report more clean, I use comments inside the code part to explain every steps instead of putting them in the content.)
Here is the result(only shows 150 lines):
Ranking by the number of books the authors have published.
As we can see from the graph, the number one most productive writer is Agatha Christie, whose books are 69 included. Then no.2 is Stephen King. Both of them are my favorite writers, but I never have enough time to finish all of their masterpieces.
The top 20 writhers have written over 25 books each. However, there are still more normal authors than popular authors, which means more than half of the writers from this dataset have written only 1 or 2 books.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 31.0 207.0 304.0 351.7 429.8 6576.0
Some ratings are not included due to their low rating count. Only the book which owns more than 10 ratings can be considered.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.770 3.950 3.929 4.130 5.000
## .
## Few Discussion Normal Discussion Lots Discussion
## 6485 3973 2884
In order to analyze the relevance between indenpendent variables and dependant variables, the tool I need is the correlation matrix. As a result, I use the DataExplorer package to present it.
As it shows on the matrix, the Page Number and the Ratings Count have 0.86 positive correlation, which means the book which owns more pages can have more ratings, or people would be more likely to rate thick books. Besides, the Page Number is also connected to the Average Rating, the correlation equals to 0.18, which is still more than 0.05. It proves that people tend to give higher score to the books which have more pages. The author’s reputaion also influence a little bit about the average rating (correlation coeffience is 0.07 & -0.07, which is not in the range of 95%), however, it is not as significant as other two factors.
Analysing the dataset from one of the most biggest reading website makes us have a clear thought about the books and writers.
We discussed the factors which influence the average ratings of books on Goodreads. People love talking about river novels, espacially high rating ones. There are interactions among the books’ pages number, books rating scores and the number of review writing.
Like I mentioned before, I want to make my own reading list according to the data statement. I will try to read some books of the top 10 writers whom I didn’t know before, however, I still think that choosing books always relies on personal tastes. How about you, will you try to read some books after reading this report?
Thanks for reading and keep on reading!